GGplot intro

GGPlot is part of the tidyverse. It is called a “grammar of graphics” (the gg) because you have different elements that you add by layers, like verbs and adjectives they transform the visualization in some way, as those transform meaning. In base R you can’t just change things by addition +. THere is of course no shame however in using the base functions for plotting, sometimes they offer more liberty and are handier. Often however ggplot will be able to handle complex visualizations quicker. We can then exploit that to do more with our graphs.

This is the general structure (r4ds)

This is a group of Zebra Finches By Accelerated Growth following Poor Early Nutrition Impairs Later Learning. Gross L, PLoS Biology Vol. 4/8/2006, e270. https://dx.doi.org/10.1371/journal.pbio.0040270, CC BY 2.5, https://commons.wikimedia.org/w/index.php?curid=1479112
This is a group of Zebra Finches By Accelerated Growth following Poor Early Nutrition Impairs Later Learning. Gross L, PLoS Biology Vol. 4/8/2006, e270. https://dx.doi.org/10.1371/journal.pbio.0040270, CC BY 2.5, https://commons.wikimedia.org/w/index.php?curid=1479112

Basic plot structure

Below is a basic GGplot structure.

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>, 
  ) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION>

We can specify a number of aspects with this grammar.

Main elements of GGplot grammar of graphics * Data: Acceptable formats are data.frame or tibble * Geometry: geom_, functions like geom_point(), geom_line() * Stats: stats_, for statistical transformations, like stat_summary() * Aesthetic: aes(), for mapping variables to visual properties * Facets: facets_, for creating multi-panel plots with facet_wrap() or face_grid() * Coordinates: coord_ for adjusting scale and axis, e.g. coord_flip(), scale_x_log10()

Loading our Zebra Finch dataset

We are going to use the famous Zebra Finch dataset to explain how GGplot works.

See the data card for Darwin’s Finch Evolution Dataset from in Kaggle

(We are taking inspiration from the python analysis here

# let's load the tidyverse, which holds ggplot
library(tidyverse) 
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#setwd('./03_session')

# let's load the Finch data, it is actually four datasets
beak_t1<- read.csv("./data/finch_beaks_1975.csv")     # peak measurements in1 1975 for scandens / fortis
beak_t2<- read.csv("./data/finch_beaks_2012.csv")     # peak measurements in1 2012 for scandens / fortis
# fortis<- read.csv("./data/fortis_beak_depth_heredity.csv")      # species heredity info
# scandens<- read.csv("./data/scandens_beak_depth_heredity.csv")    # other species heredity info

How do we combine / harmonize datasets into one?

Following up on our data wrangling worksheets. We will find it a lot handier for plotting if we combine / harmonize datasets.

# Add a new variable to each dataset to indicate the year
beak_t1 <- beak_t1 %>% mutate(year = 1975)
beak_t2 <- beak_t2 %>% mutate(year = 2012)

# Combine the datasets using bind_rows (dplyr::bind_rows())
finch_data <- bind_rows(beak_t1, beak_t2)

# View the combined dataset
head(finch_data)
##   band species Beak_length.mm Beak_depth.mm year
## 1    2  fortis            9.4           8.0 1975
## 2    9  fortis            9.2           8.3 1975
## 3   12  fortis            9.5           7.5 1975
## 4   15  fortis            9.5           8.0 1975
## 5  305  fortis           11.5           9.9 1975
## 6  307  fortis           11.1           8.6 1975

Basic scatterplot

Let’s start with a simple scatterplot to explore the relationship between beak length and depth.

# check the dataset is of the right format: here data.frame
class(finch_data)
## [1] "data.frame"
# let's select the fortis data to plot
fortis_data<- finch_data %>% filter(species=="fortis")

# we start by specifying the dataset
ggplot(fortis_data, 
       # aesthetic layer: we specify what is the x-axis variable, and the y-axis variable
       aes(x = Beak_length.mm, y = Beak_depth.mm)
       ) + 
       # we specify that we want dots
       geom_point()

# Try plotting the 'scandens' data now based on the above

Grouping

We can also readily display the data for scandens and fortis in the same plot (a bit messy) with grouping parameter colour or group (not with geom_point()), to see wide differences in beak shape between species.

ggplot(finch_data, 
       # aesthetic layer: we separate groups by colour
       aes(x = Beak_length.mm, y = Beak_depth.mm, colour = species)) + 
       geom_point()

Adding color to represent Volume

Try to produce a scatterplot of beak lengths with geom_boxplot() by species.

Faceting

What if we now want to split by year too to explore (relatively) short term evolution? Faceting allows you to create multiple panels for subsets of your data.

You can also create a grid of facets. This way we can create a 2D grid with one factor determining the row and another the column.

Adding Labels

Now labels are essential for making your plots informative.

ggplot(
  fortis_data, # another way to select the species 
  aes(x = Beak_length.mm, y = Beak_depth.mm)) + 
  geom_point() + 
  # Adding labels to a plot 
  labs( 
    x = "Beak length in mm",   # x-axis label
    y = "Beak depth in mm",    # y-axis label
    title = "Finch Beak Size by Beak Depth in Fortis",       
    caption = "Source: Kaggle Finch Dataset" # a good place to place the source of the data for instance
    )

You can also label individual points:

# Label points with geom_text 
ggplot(mpg, aes(x = displ, y = hwy, label = model)) + geom_text()

Saving plots

Use ggsave() to save your plots with high resolution.

Save the plot as a PNG file

ggsave("finch_plot.png", dpi = 600, height = 4, width = 5, units = "in")

Themes

Themes allow you to customize the appearance of your plots in one go.

# Apply a black-and-white theme 
ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + 
  theme_bw()  

We can further customize themes. Here by removing grid lines and increasing font size.

# Apply a black-and-white theme 
ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + 
  theme_bw(base_size = 16) + 
  theme( 
    # remove grids by specifying theme element_
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank() 
    )

You can also change the font. Being able to change the size of font in all elements of a plot by changing one argument is really handy when it comes to prepare plots for posters or publications.

# Change font to serif 
ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point(size=8, alpha=.2) + 
  theme_bw(base_size = 30) + 
  theme(text = element_text(family = "serif"))

# Exercise

# Check ?theme_bw() and ?theme() to find out about theme_bw() and general theme parameters
# 1. assign the plot to a handle h
# 2. try different pre-defined themes with h+theme_*
# 3. change the default font family to "sans" (sans serif)
# 4. change the "face" of the font to "italic"
# 5. change the font size and dots' size so it is visible from 3 meters away (hint: size parameter for dots) 
# 6. add transparency to the dots so we can appreciate their density (hint: alpha parameter for dots)
# Alpha adjusts the transparency with a range of 0-1 with 1 being entirely opaque, 0 invisible

Bringing it all together

Create xxx in the chunk below.

# Try using the formatting conventions
# remember the + separating graph layers and the , separating parameters within a function

Work on the following graph.

# Change font to serif 
ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point(size=8, alpha=.2) + 
  theme_bw(base_size = 30) + 
  theme(text = element_text(family = "serif"))

# Exercise

# Check ?theme_bw() and ?theme() to find out about theme_bw() and general theme parameters
# 1. assign the plot to a handle h
# 2. try different pre-defined themes with h+theme_*
# 3. change the default font family to "sans" (sans serif)
# 4. change the "face" of the font to "italic"
# 5. change the font size and dots' size so it is visible from 3 meters away (hint: size parameter for dots) 
# 6. add transparency to the dots so we can appreciate their density (hint: alpha parameter for dots)
# Alpha adjusts the transparency with a range of 0-1 with 1 being entirely opaque, 0 invisible